120 ◾ Bioinformatics
Notice that we added a RG header “@GR” with “ID” and “SM” to each SAM file to pro-
vide the sample information that will be used later by the variant caller to create a genotype
column for each sample.
4.2.1.1.5 Converting SAM into BAM format
The “bwa mem” aligner created a SAM file. We can convert the SAM files into a BAM for-
mat by running the following script. Make sure that you are running the script from the
main project directory.
mkdir bam
cd sam
for i in $(ls *.sam | rev | cut -c 5- | rev);
do
samtools view -uS -o ../bam/${i}.bam ${i}.sam
done
cd ..
The above script will create the subdirectory “bam” and convert the SAM files into BAM
files and stores them in the “bam” subdirectory.
4.2.1.1.6 Sorting and indexing BAM files
The BAM files must be sorted and indexed before the next step of the analysis. The follow-
ing bash script creates a new subdirectory “sortedbam”, sorts the bam files, and saves the
new BAM files with sorted alignment in the new subdirectory:
mkdir sortedbam
cd bam
for i in $(ls *.bam);
do
samtools sort -T ../sortedbam/tmp.sort -o ../sortedbam/${i} ${i}
done
cd ..
cd sortedbam
for i in $(ls *.bam);
do
samtools index ${i}
done
cd ..
4.2.1.1.7 Marking duplicates
The identical reads mapped to the reference genome may bias variant calling. The dupli-
cate reads are created by the PCR amplification during sequencing. Usually, using small
amount of DNA for library preparation may lead to more duplication. Some sequencing
protocols use Unique Molecular Identifiers (UMIs) to distinguish between the artifact and
natural duplicates. Marking or removing duplicate aligned reads is a common best practice